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Abstract. This paper studies sparse density estimation via l-i penalization (SPADES). We 
focus on estimation in high-dimensional mixture models and nonparametric adaptive den- 
sity estimation. We show, respectively, that SPADES can recover, with high probability, 
the unknown components of a mixture of probability densities and that it yields minimax 
adaptive density estimates. These results are based on a general sparsity oracle inequality 
that the SPADES estimates satisfy. 
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1. Introduction 

Let Xi, . . . , Xn be independent random variables with common unknown density / in 
M"^. Let {/i, . . . ,/m} be a finite set of functions with fj £ L2(M'^), j = 1, . . . ,M, called a 
dictionary. We consider estimators of / that belong to the linear span of {/i, . . . , fu}- We 
will be particularly interested in the case where M ^ n. Denote by f\ the linear combinations 

M 

= ^ >^jfj{x), A = (Ai, . . . , Am) e 



^jjjy-^)^ ^ — ■ ■ ■ : '^M ) ^ ■ 
Let us mention some examples where such estimates are of importance. 
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• Estimation in sparse mixture models. Assume that the density / can be represented 
as a finite mixture / = f^* where fj are known probabihty densities and A* is a vector 
of mixture probabihties. The number M can be very large, much larger than the 
sample size n, but we believe that the representation is sparse, i.e., that very few 
coordinates of A* are non-zero, with indices corresponding to a set /* C {1, . . . , M}. 
Our goal is to estimate the weight vector A* by a vector A that adapts to this unknown 
sparsity and to identify /*, with high probability. 

• Adaptive nonparametric density estimation. Assume that the density / is a smooth 
function, and {/i, . . . , /jv/} are the first M functions from a basis in L2{M.'^). If the 
basis is orthonormal, a natural idea is to estimate / by an orthogonal series estimator 
which has the form with A having the coordinates Xj = Y17=i fji-^i)- However, 
it is well known that such estimators are very sensitive to the choice of M, and a 
data-driven selection of M or thresholding is needed to achieve adaptivity (cf., e.g., 
\30 \ \21 \ [6]): moreover these methods have been applied with M < n. We would like 
to cover more general problems where the system {fj} is not necessarily orthonormal, 
even not necessarily a basis, M is not necessarily smaller than n, but an estimate of 
the form still achieves, adaptively, the optimal rates of convergence. 

• Aggregation of density estimators. Assume now that /i, . . . , Jm are some preliminary 
estimators of / constructed from a training sample independent of {Xi, . . . , and 
we would like to aggregate /i, . . . , Jm- This means that we would like to construct 
a new estimator, the aggregate, which is approximately as good as the best among 
/i;---)/Af or approximately as good as the best linear or convex combination of 
/i 1 • • • ) /m • General notions of aggregation and optimal rates are introduced in [27l 
[33] . Aggregation of density estimators is discussed in [311 [291 121] and more recently 
in [5] where one can find further references. The aggregates that we have in mind 
here are of the form with suitably chosen weights A = A(Xi, . . . , Xn) G M^^. 

In this paper, we suggest a data-driven choice of A that can be used in all the examples 
mentioned above and also more generally. We define A as a minimizer of an ^i-penalized 
criterion, that we call SPADES (SPArse Density Estimation). This method was introduced 
in [11]. The idea of ii penalized estimation is widely used in the statistical literature, mainly 
in linear regression where it is usually referred to as the Lasso criterion |32[ \T2\ [TS] [T8l [26] . 
For Gaussian sequence models or for regression with orthogonal design matrix the Lasso 
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is equivalent to soft thresholding [TU [Mj- Model selection consistency of the Lasso type 
linear regression estimators is treated in many papers including pU HOj EU Hll [25] . Recently, 
ii penalized methods have been extended to nonparametric regression with general fixed or 
random design [51 [Qj [IHl H] , as well as to some classification and other more general prediction 
type models [22l [23l [Ml [7] . 

In this paper we show that ii penalized techniques can also be successfully used in density 
estimation. In Section 2 we give the construction of the SPADES estimates and we show 
that they satisfy general oracle inequalities in Section 3. In the remainder of the paper we 
discuss the implications of these results for two particular problems, identification of mixture 
components and adaptive nonparametric density estimation. For the application of SPADES 
in aggregation problems we refer to [llj . 

Section 4 is devoted to mixture models. A vast amount of literature exists on estimation 
in mixture models, especially when the number of components is known; see e.g. [36] for 
examples involving the EM algorithm. The literature on determining the number of mixture 
components is still developing, and we will focus on this aspect here. Recent works on the 
selection of the number of components (mixture complexity) are |20l [2] . A consistent selection 
procedure specialized to Gaussian mixtures is suggested in [20]. The method of [20] relies on 
comparing a nonparametric kernel density estimator with the best parametric fit of various 
given mixture complexities. Nonparametric estimators based on the combinatorial density 
method (see [13]) are studied in |2,;3]. These can be applied to estimating consistently the 
number of mixture components, when the components have known functional form. Both 
[201 (2] can become computationally infeasible when M, the number of candidate components, 
is large. The method proposed here bridges this gap and guarantees correct identification of 
the mixture components with probability close to 1. 

In Section 4 we begin by giving conditions under which the mixture weights can be esti- 
mated accurately, with probability close to 1. This is an intermediate result that allows us 
to obtain the main result of Section 4, correct identification of the mixture components. We 
show that in identifiable mixture models, if the mixture weights are above the noise level, 
then the components of the mixture can be recovered with probability larger than 1 — e, for 
any given small e. Our results are non-asymptotic, they hold for any M and n. Since the 
emphasis here is on correct component selection, rather than optimal density estimation, the 
tuning sequence that accompanies the ii penalty needs to be slightly larger than the one 
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used for good prediction. The same phenomenon has been noted for ii penaUzed estimation 
in regression and generahzed regression model, see, e.g., [7J. 

Section 5 uses the oracle inequalities of Section 3 to show that SPADES estimates adap- 
tively achieve optimal rates of convergence (up to a logarithmic factor) simultaneously on 
a large scale of functional classes, such as Holder, Sobolev or Besov classes, as well as on 
the classes of sparse densities, i.e., densities having only a finite, but unknown, number of 
non-zero wavelet coefficients. 

2. Definition of SPADES 

Consider the L2(M'^) norm 

llsll = 

associated with the inner product 

<g,h>= I g{x)h{x)dx 



\ 1/2 



for g,h L2(M'^). Note that if the density / belongs to L2{W^) and X has the same distri- 
bution as Xi, we have, for any g £ L2, 

<gJ>=Kg{X), 

where the expectation is taken under /. Moreover 

(2.1) 11/ - gf = Wff + \\gf -2<g,f>= \\ff + \\gf - 2EgiX). 

In view of identity (j2.ip . minimizing ||f;^ — /|p in A is the same as minimizing 

7(A) = -2EfA(X) + ||fAf. 

The function 7(A) depends on / but can be approximated by its empirical counterpart 

2 " 



n . 



This motivates the use of 7 = 7(A) as the empirical criterion, see, for instance, (UJ \5U[ I37| . 
We define the penalty 

M 

(2.2) pen(A) = 2^a;j|Aj| 
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with weights u!j to be specified later, and we propose the following data-driven choice of A: 
(2.3) A = argmin{7(A) + pen(A)} 

— y^^xiXi) + Whf + 2 VwjIAjl 

i=l j=l 

Our estimator of density / that we will further call the SPADES estimator is defined by 

It is easy to see that, for an orthonormal system {fj}, the SPADES estimator coincides with 
the soft thresholding estimator whose components are of the form Xj = (1 — uj /\Xj\)^Xj 
where Xj = Yli=i fji-Xi) and x+ = max(0, x). We see that in this case ujj is the threshold 
for the jth component of a preliminary estimator A = (Ai, . . . , Am)- 

The SPADES estimate can be easily computed by convex programming even if M n. 
It retains the desirable theoretical properties of other density estimators, the computation 
of which may become problematic for M ^ n. We refer to [13] for a thorough overview 
on combinatorial methods in density estimation, to [M| for density estimation using support 
vector machines and to [6] for density estimates using penalties proportional to the dimension. 

3. Oracle inequalities for SPADES 
3.1. Preliminaries. For any A G M^^, let 

J{X) = {je{l,...,M}:Xj^O} 
be the set of indices corresponding to non-zero components of A and 

M 

M{X) = \J{X)\=Y,I{Xj^O} 

its cardinality. Here /{•} denotes the indicator function. Furthermore, set 

a2 = Var(/,(Xi)), L,- = 

for 1 < J < M, where Var((^) denotes the variance of random variable C and || • ||oo is the 
LociW^) norm. 

We will prove sparsity oracle inequalities for the estimator A = X{uji, . . . ,ujm), provided 
the weights ujj are chosen large enough. We first consider a simple choice: 

(3.1) ojj = ALjr{5/2) 
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where < (5 < 1 is a user-specified parameter and 



(3.2) r{S) = r{M,n,6) = JMER. 

V n 

The oracle inequalities that we prove below hold with a probability of at least 1 — 6 and are 
non-asymptotic: they are valid for all integers M and n. The first of these inequalities is 
established under a coherence condition on the "correlations" 

\\Ji\\ WJj II 

For A G M.^ , we define a local coherence number (called maximal local coherence) by 



and we also define 



and 



3.2. Main results. 



p{X) = max max\pM{i,j)\, 



Fix) = max f,, „ , 



Theorem 1. Assume that Lj < oo for 1 < j < M . Then with probability at least 1 — S for 
all A G that satisfy 

(3.3) l6GF{X)p{X)M{X) < 1 

and all a > 1 and we have the following oracle inequality: 

M n 2 

" + ^(^)T.^f^^ - ^^-1 ^ ^llfA - /f + ^{F{X)G}'r\6/2)M{X). 
j=i 

Note that only a condition on the local coherence (|3.3p is required to obtain the result 
of Theorem [TJ However, even this condition can be too strong, because the bound on "cor- 
relations" should be uniform over j £ J{X),i ^ j, cf. the definition of p{X). For example, 
this excludes the cases where the "correlations" can be relatively large for a small number 
of pairs {i,j) and almost zero for otherwise. To account for this situation, we suggest below 
another version of Theorem [TJ Instead of maximal local coherence, we introduce cumulative 
local coherence defined by 

P*W = Yl Y\pMihj)\- 

jGJ(A) j>i 
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Theorem 2. Assume that Lj < oo for 1 < j < M. Then with probability at least 1 — 6 for 
all A G that satisfy 



(3.4) 16F{X)Gp,{X)^/M(X) <1 

and all a > 1 we have the following oracle inequality: 

M 2 

II/* -ff + E-.l^. - ^ ^llfA - /f + f^{F(A)G}V2(V2)M(A). 

i=i 

Theorem [2] is useful when we deal with sparse Gram matrices = (< fi-,fj >)i<ij<M 
that have only a small number N of non-zero off-diagonal entries. This number will be called 
a sparsity index of matrix , and is defined as 

N=\{ii,j) : i,j G {l,...,M},i>j and ^m(«,j) / 0}|, 

where ipM{i,j) is the (i, j)th entry of and \A\ denotes the cardinality of a set A. Clearly, 
N < M{M + l)/2. We therefore obtain the following immediate corollary of Theorem [21 

Corollary 1. Let '^m be a Gram matrix with sparsity index N. Then the assertion of 
Theorem holds if we replace there by the condition 



(3.5) 16F(A)iVVM(A) < 1. 

We finally give an oracle inequality, which is valid under the assumption that the Gram 
matrix is positive definite. It is simpler to use than the above results when the dictionary 
is orthonormal or forms a frame. Note that the coherence assumptions considered above do 
not necessarily imply the positive definiteness of ^m- Vice versa, the positive definiteness of 
does not imply these assumptions. 

Theorem 3. Assume that Lj < oo for 1 < j < M and that the Gram matrix is positive 
definite with minimal eigenvalue larger than or equal to km > 0. Then, with probability at 
least 1 — S, for all a > 1 and all A G we have 

(3.6) ||/»-/||^ + ^f;.,.|A,-A,| < ^H,,-,f+(^)m, 

where 



JGJ(A) jeJ(A) 
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We can consider some other choices for ujj without affecting the previous results. For 
instance, 

(3.7) iOj = 2V2ajr{6/2) + ^Ljr^{6/2) 
or 

(3.8) ujj = 2V2Tjr{S/2) + lLjr^{S/2) 



with 

T! = -Y.ff{X,) + 2Lyi6/2). 



2 " 



n 
1=1 



yield the same conclusions. These modifications of (|3.1|) prove useful, for example, for sit- 
uations where fj are wavelet basis functions, cf. Section 5. The choice ()3.8p of ujj has an 
advantage of being completely data-driven. 



Theorem 4. Theorems [IH3 and Corollary [I] hold with the choices 1^3. 7| ) or h3. <S'|) for the 
weights ujj without changing the assertions. They also remain valid if we replace these ujj by 
any ujj such that lOj > ujj . 

If ujj is chosen as in (j3.8p . our bounds on the risk of SPADES estimator involve the random 
variables (1/?^) X^^Li /j (-'^j)- These can be replaced in the bounds by deterministic values 
using the following lemma. 

Lemma 1. Assume that Lj < oo for j = 1, . . . , M . Then 

(3.9) ^(^ifl (^^) ^ 2E/|(Xi) + ^Ly{6/2), Vj = 1, . . . , > 1 - 5/2. 

From Theorem |3] and Lemma [1] we find that, for the choice of tuj as in (j3.8|) . the oracle 
inequalities of Theorems [THH] and Corollary [1] remain valid with probability at least 1 — 36/2 
if we replace the ojj in these inequalities by the expressions 2V2fjr{6/2) + (8/3)Ljr2(<5/2) 

1 /2 

where T,- = (2EffiXi) + {4/3)1^(5/2)'^^ 



3.3. Proofs. We first prove the following preliminary lemma. Define the random variables 

n 

and the event 



n . 

1=1 



M 



(3.10) ^ = Pi {2\Vj\ < ujj} . 
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Lemma 2. Assume that Lj < oo for j = 1, . . . , M . Then for all X e M*^ we have that, on 
the event A, 

M 

(3.11) ||/*_/||2 + ^^^.|A^,_A^.|<||f,-/||2 + 4 ^ ujf\,-\,\. 

3='^ jeJ{x) 

Proof. By the definition of A, 

„ n M r) " 

^ X ^ , / ,^ \ II ^ ii9 X ^ 1^1 ^ 



--EfA(^') + l|fAlP + 2E^^-|^^-|^--EM^^) + l|fAf + 2^a;,|A. 

1=1 j=l 1=1 _;=1 

for all A G M^. We rewrite this inequality as 



n M M 

< ||fA-/f -2</,/*-fA>+-^(/*-fA)(X,) + 2^a;,|A,|-2^a;,|A,| 

i=l j=l j=l 

= \h-ff + ^Y.{-H - E/, (X,) (A, - A,) 

j=l V i=l / 

M M 

+2E'^j|A,|-2^a;,|Aj|. 
Then, on the event A, 

M MM 

< ||fA-/|P + i;a;,|A,-A,| + 2^a;,|A,|-25^a;,|A,-|. 

i=i j=i j=i 



Add WjlAj — \j\ to both sides of the inequality to obtain 

M 

M MM 

< \\h - /f + 2 J]a;,|A,- - A,| + 2^a;,|A,| - 2^a;,|A,| 

j=i j=i j=i 

M 

jeJ(A) j=i jeJ(A) 

<l|fA-/f + 4 '^ol^o-H 
jeJ{x) 

where we used that Xj = for j J(A) and the triangle inequality. □ 
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For the choice ()3.ip for ujj, we find by HoefFding's inequahty for sums of independent 
random variables Qj = fj{Xi) — E/j(Xj) with \C,ij\ < 2Lj that 

FiA) < ^P{2|y,|>..,}<2^exp(^-^^j =5. 



Proof of TheoremUl In view of Lemma [21 we need to bound YljeJ{x) ~ Set 

«j=^i-^i' =EjeJ(A)l'"jlll/jll' U = J2jii\uj\\\fj\\. 

Then, by the definition of F{X) and the Cauchy-Schwarz inequahty 

ujj\Xj-Xj\ < rF{X)U{X). 



Since 



we obtain 



(3.12) n,^||/,f = ||/*-f,||2_^^^,^.</,,/^.> 

ieJ(A) i,j^J(X) 

-2^ Y U^Uj < fi, fj > - U^Uj<fiJj> 

= II/* - fAlP + 2p{X)U{X)U - p{X)U\X). 

The left-hand side can be bounded by X]jeJ(A) ll/ilP — t^^(A)/M(A) using the Cauchy- 
Schwarz inequality, and we obtain that 

U\X) < II/* - fxfM{X) + 2p(A)M(A)[/(A)C/, 

which immediately implies 



(3.13) [/(A) < 2p(A)M(A)[/ + y'M{X)\\f* - 
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Hence, by Lemma [21 we have with probabiUty at least 1 — 5, 
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M 



< 

< 
< 

< 



|fA-/f +4 ^ u;,\X,-Xj 
|fA-/f +4rF(A)[/(A) 



A - /f + 4rF(A) {2p{X)M{\)U + 7m(A)||/* - fA||} 



M 



h - /f + 8F{X)p{X)A4{X)GY,^j\X, -X,\+ 4rF(A)yM(A)||/* - h\ 



where r = r{5/2). For all A G M*^ that satisfy relation (j3.3p . we find that with probability 
exceeding 1 — 5, 

ll/*-/ll'+2E'^J-|^J--^^-| ^ ||fA-/f + 4rF(A)GyM(A)||/*-fA|| 

< ||fA-/f + 2{2rGF(A)yM(A)}||/*-/|| 
+2{2rGF(A)v^M(A)} ||fA-/||. 
After applying the inequality 2xy < x^/a + ay^ (x,y G M, a > 0) for each of the last two 



summands, we easily find the claim. 



Proof of Theorem [3 The proof is similar to that of Theorem [H With 



□ 



V ie J(A) 



we obtain now the following analogue of (j3.12p : 

Ul{X) < ||/*-fAf + 2p,(A) max . |ni|||/i|||n,-|||/,-| 



< 



ieJ(X),j>i 
M 

♦-fAf + 2p,(A)C/,(A) J^ln.lll/, 



fxW' + 2p,{X)U,iX)U. 
Hence, as in the proof of Theorem [U we have 



U^X) < 2y9,(A)C/+||/*-fA| 
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and using the inequality f7*(A) > U{X)/ ^JM{X) we find 

(3.14) C/(A) < 2p.(A)0WXA)C/ + yM(A)||/*-fA||. 

Note that ([33i|) differs from (f333|) only in the fact that the factor 2p(A)M(A) on the right 



hand side is now replaced by 2p^,{\)^J M[X). Up to this modification, the rest of the proof is 
identical to that of Theorem [TJ □ 

Proof of Theorem By the assumption on ^ m we have 



^ XiXj / ^ fi{x)fj{x) dx > KM Xj. 



|2 _ 

i<i,j<M ieJ(A) 
By the Cauchy-Schwarz inequality, we find 



4 u;,|A,-A,| < 4 / Y E 1^^--^. 

jeJ(A) V J(A) V J(A) 



■|2 



< 4f5^i:^y'' ||/*-f,||. 
Combination with Lemma [2] yields that, with probability at least 1 — S, 

(3.15) \\f^ - ff + Y^fXj - X,\ < \\h-ff + 4 ^'^'^'^ n II/* -f 

< l|fA-/f + 6(||/*-/|| + ||fA-/||) 



where b = ^^jYlij<^j{X) ^ jl ^/'^'^M- Applying the inequality 2xy < x /a + ay (x, y G M, a > 
0) for each of the last two summands in (|3.15|) we get the result. □ 

Proof of Theorem^ Write u)j = 2^/2ajr{5/2) + (8/3)Ljr2(5/2) for the choice of u}j in ([3 
Using Bernstein's exponential inequality for sums of independent random variables Qij 
fj{Xi) — E/j(Xj) with < 2Lj, we obtain that 

(M \ M 

[j{2\Vj\ > LVj} < ^P{2|yj| > LVj} 
i=i / i=i 



2Var(/j(Xi)) + 2LjCUj/3 
< Mexp{-nr^{6/2)) =6/2. 
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Let now uj be defined by <\3.8h . Then, using ()3.16p . we can write 



(3.17) 



Define 



M 



M M 

< Y^F{2\Vj\>0,} + J2n^j>^j} 
i=i i=i 



and note that 



Then 



E//(Xi)log(2M/5) 



n 



i=l 



¥{cdj>u;j} = P{Var(/j(Xi)) >r/} 

< P{E/|(Xi)>§f;/2(X,) + t,} 



< 



exp 



n 

i=l 

n{E/2(Xi)+t,}2^ 



8E//(X 



using Proposition 2.6 in [38j 



< exp 



nt.EffjXiY 
' 2Eff{Xi) 



since (x + y)^ > 4xy 



which is less than 5/{2M). Plugging this in (j3.17p concludes the proof. 
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Proof of Lemma[l\ Using Bernstein's exponential inequality for sums of independent random 
variables ff{Xi) - Eff{X,i) and the fact that E//(Xi) < LjEff{Xi) we find 



4 r2^2 

< exp 



L--Jj V-'-i; I 3^j' 

< exp(-nr2(5/2)) = ^, 
which implies the lemma. □ 

4. Sparse estimation in mixture models 
In this section we assume that the true density / can be represented as a finite mixture 

m = Y,x*fj{x), 

where /* C {1, . . . ,M} is unknown, fj are known probability densities and 7^ for all 
j ^ I* . The focus of this section is on model selection, i.e., on the correct identification of 
the set /*. We set A* = (AJ, . . . , A^,^) where A* = 0,j ^ I*. 

For clarity of exposition we consider a simplified version of the general set-up introduced 
above. We compute the estimates of A* via (|2.3p . with weights defined by (cf. (j3.ip ): 

ujj = 4Lr, for all j, 

where r > is a constant that we specify below, and for clarity of exposition we replaced 
all Lj = ||/j||oo by an upper bound L on maxi<j<Af Lj. We assume that all fj have been 
standardized to have = 1. Note that under these assumptions condition (j3.3p takes the 
form 

We state ()4.ip for the true vector A* in the following form. 
Condition (A). 



16k* 

where k* = = M(A*) and p* = p{X*). 
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The results of Section [3] are valid for any r larger or equal to r{6/2) = {log(2M/5)/n}^/^. 
They give bounds on the predictive performance of SPADES. As noted in, e.g., [7j, for ii- 
penalized model selection in regression, the tuning sequence ojj required for correct selection 
is typically larger than the one that yields good prediction. We show below that the same 
is true for selecting the components of a mixture of densities. Specifically, in this section we 
will take the value 



(4.2) r = r(M,n,V(2M)) = /°g^'^'/^\ 

V n 

We will use the following corollary of Theorem [U obtained for a = \f2. 

Corollary 2. Assume that Condition (A) holds. Then with probability at least 1 — 5/M we 
have 

(4.3) f I, i^.-ysM. 



Inequality ()4.3p guarantees that the estimate A is close to the true A* in ii norm, if the 
number of mixture components k* is substantially smaller than y/n. We regard this as an 
intermediate step for the next result that deals with the identification of /*. 

4.1. Correct identification of the mixture components. We now show that /* can 
be identified with probability close to 1 by our procedure. Let / = J(A) be the set of 
indices of the non-zero components of A given by (j2.3p . In what follows we investigate when 
P(I = I*) > 1 — e for a given < e < 1. Our results are non-asymptotic, they hold for any 
fixed M and n. 

We need two conditions to ensure that correct recovery of /* is possible. The first one is 
the identifiability of the model, as quantified by Condition (A) above. The second condition 
requires that the weights of the mixture are above the noise level, quantified by r. We state 
it as follows. 

Condition (B). 



min|A*| > 4(V2 + l)rL 



where L = max (l/-v/3, maxi<j<j\,f Lj) and r is given in (j4.2p . 

Theorem 5. Let < (5 < 1/2 be a given number. Assume that Conditions (A) and (B) hold. 
Then P(/ = /*) > 1 - 25(1 + 1/M). 
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Proof. We begin by noticing that 



and we control each of the probabihties on the right hand side separately. 



Control of P(/* ^ /). By the definitions of the sets / and /* we have 

P(/* 2 /) < P(Afc = for some k £ I*) 
< A:*max P(Afc = 0). 

k<=I* 

We control the last probability by using the characterization (j5.9|) of A given in Lemma [3] of 
the Appendix. We also recah that Efk{Xi) = J2jei* ^j{fk,fj) = Y^jLi ^j{fk,fj), since we 
assumed that the density of Xi is the mixture /* = J2jei* ^*jfr therefore obtain, for 
k E /*, 



Afc = 



< 4rL; Afe = 



(4.4) 



(4.5) 



< 



< 



^ fk{x^) - m{Xi) - x;(A, - \*){f„ f, 
i=i j=i 

1 " 

i=l 

1 " 

-y^fk{Xi)-Efk{Xi 



M 



< ArL; Xk = 



< ArL 



1=1 



lA^IIIAf 



3^k 



> 



2rL 



2rL . 



To bound (j4.4p we use Hoeffding's inequality, as in the course of Lemma [2j We first recall 
that ||/fc|| = 1 for all k and that, by Condition (B), min^g/* |A^| > 4(\/2 + l)Lr, with 
r = r((5/(2M)) = {log(2MV'^)/n}^/^ Therefore 



(4.6) 



/ n 

-V/fc(x,)-E/fc(Xi; 



> 



I At I 



2rL 



< 



1 " 

-V/,(X,)-E/fc(Xi) 
n ^-^ 



1=1 



> 2V2rL < 



M2' 
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To bound ()4.5p notice that, by Conditions (A) and (B), 

I X* I 



17 



> 



2rL 



M 



<F {"^IXj - X*\> 32V2rLk* 



M 



A^/2rk* 



M' 



where the penultimate inequahty holds since, by definition, > 1/3 and the last inequality 
holds by Corollary HI 



Combining the above results we obtain 
Control of P(/ ^ /*). Let 



r) " 

(4.7) h{^,) = --Y.Y. ^^■/^■(^^) + II E /^i/^ii' + E i/^ii- 

i=l j&I* j&I* 

Let 

(4.8) /i = argmin/i(^). 
Consider the random event 

(4.9) ^ = n ^ 

Let /i G M^"'^ be the vector that has the components of /i given by ()4.8p in positions corre- 
sponding to the index set /* and zero components elsewhere. By the first part of Lemma [3] 
in the Appendix we have that /Z G is a solution of (j2.3p on the event ;B. Recall that A is 
a also solution of (|2.3p . By the definition of the set / we have that 7^ for A; G /. By con- 
struction, /ifc 7^ for some subset SCI*. By the second part of Lemma [3] in the Appendix, 
any two solutions have non-zero elements in the same positions. Therefore I = S I* on B. 



< 4Lr 



i=l 



j€l* 



18 



FLORENTINA BUNEA ALEXANDRE B. TSYBAKOV MARTEN H. WEGKAMP 



Thus, 
(4.10) 



< 



E 

kil* 



n 



> 4rL 



< 



( 1 " 

k4I* \ ^ i=l 



> 2V2rL 



fc^/* ye/* 
Reasoning as in (j4.6p above we find 

5 



( 1 " 

Eip -E/fc(^*)-^/fc(^i^ 
trf/* \ i=i 



> 2\/2rL < 



M 



To bound the last sum in (j4.10p we first notice that Theorem [T] (if we replace there r{5/2) 
by the larger value r{5 / cf. Theorem^]) applies to /i given by (j4.8p . In particular 



Vie/* 

Therefore, by Condition (A), we have 



5^ P 5^ l/i, - A*||(/„ h)\ > (4 - 2^/2)rL 
fe^/* \ie/* 

- E ^ E I^J ~ - ^2(4 - 2V2)k*rL 
HI' \ie/* 

< E IP (e 1/^.-^*1^^^*-)^^' 

HI' \j&i* I 
which holds since > 1/3. Collecting all the bounds above we obtain 



p(//r)<2<5 + ^, 



which concludes the proof. 



4.2. Example: Identifying true components in mixtures of Gaussian densities. 

Consider an ensemble of M Gaussian densities /j's in with means fij and covariance 
matrices Tjl^, where 1^ is the unit d x d matrix. In what follows we show that Condition 
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(A) holds if the means of the Gaussian densities are weh separated and we make this precise 
below. Therefore, in this case, Theorem [5] guarantees that if the weights of the mixture are 
above the threshold given in Condition B, we can recover the true mixture components with 
high probability via our procedure. 
Recall that Condition (A) requires 

< fhfj > ^ 1 
max ,, „ ,, ,, „ ,, < 




The densities are 

where || • ||2 denotes the Euclidean norm. Let Tmax = maxi<j<j\,f Tj and D'^^^ = min^^j Wfi^ 
/ijll^. Via simple algebra we obtain 



P < exp 



mm 



4r2 

Therefore, Condition (A) holds if 

^min>4TL,log(16F). 

Using this and Theorem [5] we see that SPADES can identifies the true components in a 
mixture of Gaussian densities if the square Euclidean distance between any two means is 
large enough as compared to the largest variance of the components in the mixture. 

Note that Condition (B) on the size of the mixture weights involves the constant L, which 
in this example can be taken as 

/I „ . „ \ / 1 1 

I maY r,- 1 = man: \ 



L = max I max ||/j||oo ) = max 



where Tmin = mini<j<M Tj. 

5. SPADES FOR ADAPTIVE NONPARAMETRIC DENSITY ESTIMATION 

We assume in this section that the density / is defined on a bounded interval of M that 
we take without loss of generality to be the interval [0, 1]. Consider a countable system of 
functions {V'/fci I > —\,k G V{1)} in L2, where the set of indices V{1) satisfies 1)| < C, 
2' < < C2\ I > 0, for some constant C, and where the functions psiik satisfy 



(5.1) mk\\<ci, UikWoo < ci2'/\ II E ^'fc 



< Ci2', 

00 
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for all / > —1 and for some Ci < oo. Examples of such systems {ipik} are given, for instance, 
by compactly supported wavelet bases, see, e.g., jT9j. In this case ^ik{x) = 2^/'^il){2^x — k) 
for some compactly supported function il^. We assume that {V'ifc} is a frame, i.e., there exist 
positive constants ci and C2 depending only on {V'ifc} such that, for any two sequences of 
coefficients (3ik, 

oo oo 2 °° 

(5.2) c^Y. (^ik - P'lk? < II E E (/^'^ - A'JV'zfcjl < C2 E E ^^ik - f^'ikf- 

i=~ik&v{i) i=-ikev{i) i=-ikev{i) 

If {ipik} is an orthonormal wavelet basis, this condition is satisfied with ci = C2 = 1. 

Now, choose {/i,...,/Af} = {i'lk, -1 < ^ < ^max,^ S V{1)} where Zmax is such that 
2'max ^ i^iog n). Then also M x n/(log n). The coefficients Xj are now indexed by j = {l,k), 
and W6 SGt by dcfiiiition A^^ — for /c) { — 1 — ^ — ^max? 

k £ V{1)}. Assume that there 

exist coefficients f3ff, such that 

oo 

/ = E E AfcV'ifc 

l=^lk<^V{l) 

where the series converges in L2. Then Theorem [3] easily implies the following result. 

Theorem 6. Let /i, . . . , /m be as defined above with M x n/(logn), and let ujj be given by 
113. 8\) for 5 = n~^. Then for a// n > 1, A G we have with probability at least 1 — n~^, 



(5.3) II/* -/IP < I E E (\i,k)-^!k? 

, /=-i kev{i) 



^ — ' in ^ — ' n 

(z,fc)eJ(A 

where K is a constant independent of f . 



logn' 



-n — ' n \ n 

(Z,fc)eJ(A) i=l 



This is a general oracle inequality that allows one to show that the estimator attains 
minimax rates of convergence, up to a logarithmic factor simultaneously on various functional 
classes. We will explain this in detail for the case where / belongs to a class of functions J- 
satisfying the following assumption for some s > 0. 

Condition (C). For any f £ T and any /' > there exists a sequence of coefficients A = 
{\i^kY -l<l<l',k£ V{1)} such that 



(5.4) Yl E i\i,k)-Ptkf<C22 

l=-lk£V(l) 



-2l's 
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for a constant C2 independent of /. 

It is well known that Condition (C) holds for various functional classes J-', such as Holder, 
Sobolev, Besov classes, if {tpik} is an appropriately chosen wavelet basis, see, e.g., [12] and the 
references cited therein. In this case s is the smoothness parameter of the class. Moreover, 
the basis {tpik} can be chosen so that Condition (C) is satisfied with C2 independent of s for 
all s < 

•5max) where Smax is a given positive number. This allows for adaptation in s. 
Under Condition (C) we obtain from (j5.3p that, with probability at least 1 — n~^. 



(5.5) II/* -/IP < .mill 



\ {l,k):l<l' i=l 



logn 



O 



From (|5.5p and the last inequality in ()5.ip we find for some constant K' , with probability at 
least 1 — n~^, 

,5.0) ||/*-/f < 

'logn\-2^/(2s+i)' 
n / 

where the last expression is obtained by choosing I' such that 2' x (n/ log n)"^/''^*"*"^^ It 
follows from (j5.6p that /* converges with the optimal rate (up to a logarithmic factor) 
simultaneously on all the functional classes satisfying Condition (C). Note that the definition 
of the functional class is not used in the construction of the estimator /*, so this estimator 
is optimal adaptive in the rate of convergence (up to a logarithmic factor) on this scale of 
functional classes for s < Smax- Results of such type, and even more pointed (without extra 
logarithmic factors in the rate and sometimes with exact asymptotic minimax constants) are 
known for various other adaptive density estimators, see, for instance, [16^ El [T9| [2T| [28l I29j 
and the references cited therein. These papers consider classes of densities that are uniformly 
bounded by a fixed constant, see the recent discussion in [5]. This prohibits, for example, 
free scale transformations of densities within a class. Inequality (|5.6p does not have this 
drawback. It allows to get the rates of convergence for classes of unbounded densities / as 
well. 



Another example is given by the classes of sparse densities defined as follows: 



< m 



CQ{m) = |/ : [0, 1] ^ M : / is a probability density and {j : < /, fj >^ 0} 

where m < M is an unknown integer. If /i, . . . , is a wavelet system as defined above and 
J* = {j = {I, k) : < f, fj 0}, then under the conditions of Theorem [6] for any / £ Co{m) 
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we have, with probabihty at least 1 — n~^, 

(5.7) ||/*-/||^ < k( Y, [^±iiiX,)'^ + 2'{'^y 

\il,k)eJ* i=l 

From (j5.7p . using Lemma [1] and the first two inequalities in (jS.ip we obtain the following 
result. 

Corollary 3. Let the assumptions of Theorem\^hold. Then, for every L < oo and n>l, 

(5.8) sup p|||/*-/|P > fef^^^^^^^n < (3/2)n-2, V m < M, 
/6£oMn{/:||/|U<L} I \ n J) 

where b > is a constant depending only on L. 

Corollary [3] can be viewed as an analogue for density estimation of the adaptive minimax 
results for Cq classes obtained in the Gaussian sequence model [U [l7] and in the random 
design regression model [10]. 



Appendix 

Lemma 3. (I) Let jl he given by ( (^.^[ j. Then Jl = (/i,0) G is a minimizer in X & M*^ of 

„ n M 
= "n ^ + \M'+ 8Lr ^ \X,\. 

i=l k=l 

on the random event B defined in 1^4.9^ . 

(II) Any two minimizers of g{X) have non-zero components in the same positions. 

Proof. (I). Since g is convex, by standard results in convex analysis, A G is a minimizer 
of g if and only if G where Dx is the sub differential of g{\): 

2 n M 

Dx = {weM.^ : Wk = Yl fk{Xi) + 2 A,(/,-, fk) + Srvk,Vk G Vk{\k), l<k<M] 



where 



{L] if Afc > 0, 
^fc(Afc) = { {-L] ifAfe<0, 
[-L,L] if Afc = 0. 
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Therefore, A minimizes g{-) if and only if, for all 1 < /c < M, 

^ n M 

(5.9) -Y.MXi)-Y,Xj{fj,fk) = 4Lrsign(Afc), if / 0, 



(5.10) 



i=l 

n 



M 



i=l j=l 



< ALr, if Afr = 0. 



We now show that p, = (/i, 0) G M with /i given in (j4.8p satisfies (j5.9p - (|5.10p on the event 
i3 and therefore is a minimizer of 5(A) on this event. Indeed, since /2 is a minimizer of the 
convex function h(fj,) given in (j4.7p . the same convex analysis argument as above implies that 

^ n 

-'^fk{Xi)-'^flj{fj,fk) = 4Lrsign(/ife), if /ifc / 0, G /*, 



i=l 



1 



< 4Lr, if /2fc = 0, ke I*. 



n 

i=i jei* 
Note that on the event B we also have 

1 

— X^ fk{Xi) — X^ P-j{fj: fk) < 4Lr, \i k ^ I* (for which /Xfc = 0, by construction). 

i=l JG/* 

Here Jik denotes the /cth coordinate of /U. The above three displays and the fact that fit = 
fj.k,k E /*, show that fi satisfies conditions (j5.9p ~ (j5.10p and is therefore a minimizer of g{X) 
on the event B. 

(II). We now prove the second assertion of the lemma. In view of ()5.9p the index set S of the 
non-zero components of any minimizer A of ^(A) satisfies 

^ n M 



s 



kG{l,...,M}: 



1=1 j=i 

Therefore, if for any two minimizer s A(i) and A(2) of g{X) we have 



4rL 



M 



(5.11) 



Y,Cxf -Xf ){fjJk) =0, for ah k, 



then S is the same for all minimizers of ^(A). 

Thus, it remains to show (j5.1ip . We use simple properties of convex functions. First, we 
recall that the set of minima of a convex function is convex. Then, if A(i) and A^^) are two 
distinct points of minima, so is pX^^^ + (1 — p)X^'^\ for any < p < 1. Re- write this convex 
combination as A*-^^ + PV^ where r] = X^^^ —X^'^\ Recall that the minimum value of any convex 
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function is unique. Therefore, for any < p < 1, the value of g{X) at X = + prj is equal to 
some constant C: 

n M r f \ ^ 

^(z') ---EE + p^^) /^■(^^) + / E(?^ + p^j)M^) dx 
i=ij=i \j=i ) 

M 

+ 8rL^|Af +/5r/j| = C7. 
i=i 

By taking the derivative with respect to p of F[p) we obtain that, for all < /3 < 1, 

^ n M M 



|EE^^/^(^^)+ 8rLY,mnCxf 

i=l j=l j=l 

+ 2 / (^i^f +PVj)Mx)^ (j2v,Mx)^ dx = 0. 



~(2) ~ (2) 

By continuity of p ^ Xj +prjj, there exists an open interval in (0, 1) on which p i— > sign(A^- + 
prjj) is constant for all j. Therefore, on that interval, 

2 

dx + C 



F\p) = 2pj ||f;7?,/,(x)j 



where C does not depend on p. This is compatible with F'{p) = 0, V < p < 1, (cf. (j5.12p ) 
only if 

M 

'^^Vjfj{x) = 0, for all X, 
i=i 

and therefore 

M 

Yl VjifjJk) = 0, for all ke{l,..., M}, 
which is the desired result. This completes the proof of the lemma. □ 
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